Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms

نویسنده

Jimmy J. Lin

چکیده

The purpose of this short paper is to share a recent observation I made in the context of my introductory graduate course on MapReduce at the University of Maryland. It is well known that since the sort/shuffle stage in MapReduce is costly, local aggregation is one important principle to designing efficient algorithms. This typically involves using combiners or the so-called in-mapper combiner technique [5]. However, can we be more precise in formulating this design principle for pedagogical purposes? Simply saying“use combiners”or“use in-mapper combining” is unsatisfying because it leaves open the obvious question of how? What follows is my attempt to formulate a more precise design principle in terms of monoids—the idea is quite simple, but I haven’t seen anyone else make this observation before in the context of MapReduce. Let me illustrate with a running example I often use to illustrate MapReduce algorithm design, which is detailed in Lin and Dyer [5]. Given a large number of key–value pairs where the keys are strings and the values are integers, we wish to find the average of all the values by key. In SQL, this is accomplished with a simple group-by and Avg. Here is the näıve MapReduce algorithm:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Space-Efficient Bimachine Construction Based on the Equalizer Accumulation Principle

Algorithms for building bimachines from functional transducers found in the literature in a run of the bimachine imitate one successful path of the input transducer. Each single bimachine output exactly corresponds to the output of a single transducer transition. Here we introduce an alternative construction principle where bimachine steps take alternative parallel transducer paths into account...

متن کامل

MapReduce Algorithms for Big Data Analysis

There is a growing trend of applications that should handle big data. However, analyzing big data is a very challenging problem today. For such applications, the MapReduce framework has recently attracted a lot of attention. Google’s MapReduce or its open-source equivalent Hadoop is a powerful tool for building such applications. In this tutorial, we will introduce the MapReduce framework based...

متن کامل

Parallel Decision Tree with Application to Water Quality Data Analysis

Decision tree is a popular classification technique in many applications, such as retail target marketing, fraud detection and design of telecommunication service plans. With the information exploration, the existing classification algorithms are not good enough to tackle large data set. In order to deal with the problem, many researchers try to design efficient parallel classification algorith...

متن کامل

A MapReduce and MPI Programming Model for Distributed Large Scale 3D Mesh Processing

Developing a high performance platform for large-scale, high-intensity data processing is a priority for researching cost-effective parallel finite element methods (FEM). This paper introduces an efficient MapReduce-MPI based strategy for parallel 3D finite element mesh processing, demonstrates the potential benefits of this approach for optimally utilizing system resources. Preliminary experim...

متن کامل

Sorting, Searching, and Simulation in the MapReduce Framework

In this paper, we study the MapReduce framework from an algorithmic standpoint and demonstrate the usefulness of our approach by designing and analyzing efficient MapReduce algorithms for fundamental sorting, searching, and simulation problems. This study is motivated by a goal of ultimately putting the MapReduce framework on an equal theoretical footing with the well-known PRAM and BSP paralle...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1304.7544 شماره

صفحات -

تاریخ انتشار 2013

Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms

نویسنده

چکیده

منابع مشابه

Space-Efficient Bimachine Construction Based on the Equalizer Accumulation Principle

MapReduce Algorithms for Big Data Analysis

Parallel Decision Tree with Application to Water Quality Data Analysis

A MapReduce and MPI Programming Model for Distributed Large Scale 3D Mesh Processing

Sorting, Searching, and Simulation in the MapReduce Framework

عنوان ژورنال:

اشتراک گذاری